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Abstract 

The bag-of-features (BoF) model for image classification 
has been thoroughly studied over the last decade. Differ¬ 
ent from the widely used BoF methods which modeled im¬ 
ages with a pre-trained codebook, the alternative codebook- 
free image modeling method, which we call Codebookless 
Model (CLM), attracted little attention. In this paper, we 
present an effective CLM that represents an image with a 
single Gaussian for classification. By embedding Gaussian 
manifold into a vector space, we show that the simple incor¬ 
poration of our CLM into a linear classifier achieves very 
competitive accuracy compared with state-of-the-art BoF 
methods (e.g., Fisher Vector). Since our CLM lies in a high¬ 
dimensional Riemannian manifold, we further propose a 
joint learning method of low-rank transformation with sup¬ 
port vector machine (SVM) classifier on the Gaussian man¬ 
ifold, in order to reduce computational and storage cost. 
To study and alleviate the side effect of background clutter 
on our CLM, we also present a simple yet effective partial 
background removal method based on saliency detection. 
Experiments are extensively conducted on eight widely used 
databases to demonstrate the effectiveness and efficiency of 
our CLM method. 


1. Introduction 


( Caltech 101 Flickr Material Pascal VOC2007 



Figure 1. Some example images and comparison (in %) between 
Fisher vector (FV) and our codebookless model (CLM) on various 
image databases. 


trained codebook, pooling or aggregating codes over im¬ 
ages, and finally, learning classifier (e.g., SVM) for classifi¬ 
cation. With this processing pipeline, the BoF-based meth¬ 
ods can be seen as a hand-crafted five-layer hierarchical 
feed-forward network ll43l with a pre-trained feature coding 
template (codebook) 0 . The learned codebook depicts the 
distribution of feature space, and makes coding of high di¬ 
mensional features possible. This architecture has achieved 
very promising performance in a variety of image classifi¬ 
cation tasks. 


Image classification has been attracting massive atten¬ 
tions in computer vision and pattern recognition communi¬ 
ties in recent years. It is one of the most fundamental but 
challenging vision problems because images, as illustrated 
in Fig.Q] often suffer from significant scale, view or illumi¬ 
nation variations (e.g., in texture classification (8) and ma¬ 
terial recognition (22)), and pose changes, background clut¬ 
ter, partial occlusion (e.g., in scene categorization ebed 
and object recognition mmmmi 

For a long time the bag-of-features (BoF) model ROl 
has been almost given priority to image classification. As 
shown in Fig. [2] (a), the BoF-based methods generally con¬ 
sist of five components: local features extraction, learning 
codebook with training data, coding local features with pre- 


The codebook as a reference for feature coding serves 
as a bridge between local features and global image repre¬ 
sentation. However, it is well known that segmentation of 
feature space involved in building of codebook brings on 
quantization error id, and leads to continuous striving for 
this side effect (e.g., soft coding methods mm alleviate 
but cannot completely eliminate it). Though offline, train¬ 
ing of codebook, particularly large size ones, is time con¬ 
suming. In addition, in general the pre-trained codebook on 
one database cannot naturally adapt to other databases (52) . 

An alternative approach is to estimate the statistics di¬ 
rectly on sets of local features from input images GO] [35] 
04), as illustrated in Fig.[2](b), which is called codebookless 
model (CLM) in this paper. It is clear from Fig. [2] that the 
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Figure 2. Comparison between (a) the BoF model and (b) our CLM. The major difference between them is that whether there is a pre¬ 
trained codebook & coding or not. Our CLM mainly consists of a Gaussian model for image representation and a joint low-rank learning 
with linear SYM classifier. 


major difference is that the BoF model learns a codebook 
to explore the statistical distribution of local features and 
then performs coding of descriptors, while the CLM rep¬ 
resents images with descriptors directly, requiring no pre¬ 
trained codebook and the subsequent coding. Conceptu¬ 
ally, the codebookless model has the potential to circumvent 
the aforementioned limitations of the BoF model, however, 
which has received little attention in image classification 
community. The main reasons may be that such methods 
have not yet shown competitive classification performance, 
and that they often need to utilize inefficient and unscalable 
kernel-based classifiers. 

In this paper, we propose an effective CLM scheme, and 
argue that the CLM can be a competitive alternative to the 
BoF methods for image classification. The comparison be¬ 
tween state-of-the-art BoF method, Fisher Vector (FV) (39), 
and our CLM on various image databases is shown in Fig. 
Cl First and foremost, we extract a set of local features (e.g., 
SIFT (34)) on a dense grid of image, and simply model them 
with a single Gaussian model to represent the input image. 
Then, we employ a two-step metric for matching Gaussian 
models. By using this metric, Gaussian models can be fed 
to a linear classifier for ensuring efficient and scalable clas¬ 
sification while respecting the Riemannian geometry struc¬ 
ture of Gaussian models. Moreover, we introduce two well- 
motivated parameters into the used metric. One is to bal¬ 
ance the effect between mean and covariance of Gaussian, 
and another is for eigenvalue power normalization on co- 
variance. 

Our codebookless model usually is of high dimension, 
by incorporating low-rank learning with SVM, we propose 
a joint learning method to effectively compress Gaussian 
models while respecting their Riemannian geometry struc¬ 


ture. It is mentionable that, to the best of our knowledge, 
we make the first attempt to perform joint learning of low- 
rank transformation and SVM on Gaussian manifold. Fi¬ 
nally, to alleviate the side effect of background clutter, a 
saliency-based partial background removal method is pro¬ 
posed to enhance our CLM. The experimental results show 
that partial background removal is helpful to CLM when 
images are heavily cluttered (e.g., CUB200-2011 and Pas¬ 
cal VOC2007). 

2. Related work 

The codebookless model for directly modeling the statis¬ 
tics of local features has been studied in past decades. Rub- 
ner et al. (38) introduced signatures for image representa¬ 
tion, and proposed the Earth Mover’s Distance for image 
matching which is robust but has high computational cost. 
Tuzel et al. for the first time used covariance matri¬ 
ces for representing regular image regions, and employed 
Affine-Riemannian metric which suffers from high compu¬ 
tational cost (36) . Gaussian model as image descriptor has 
been used for visual tracking (19), in which Gaussian mod¬ 
els are matched based on the Riemannian metric, involv¬ 
ing expensive operations to solve generalized eigenvalue 
problem. Going beyond Gaussian, Gaussian mixture model 
(GMM) is more informative and is used in image retrieval 
0. However, GMM suffers from some limitations, such as 
high computational cost of matching methods and lacking 
of general criteria for model selection. 

Our work is motivated by EUH and E2. Carreira 
et al. EMUS modeled the free-form regions obtained by 
image segmentation with estimating the second-order mo¬ 
ments. By using Log-Euclidean metric (2), the method in 
SHU can be combined with a linear classifier, which has 
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shown competing recognition performance on images with 
less background clutter (e.g., CaltechlOl liTH ). Different 
from mm, we employ a Gaussian model to represent the 
whole image. It is well-known that a covariance matrix 
can be seen as a Gaussian model with fixed mean vector. 
Compared to Ei uni, our CLM contains both the first-order 
(mean) and second-order (covariance) information. Note 
that the first-order statistics has proven important in image 
classification [25j 221 • Moreover, the manifold of Gaussian 
models and that of covariance matrices are quite different, 
and the embedding method in our CLM makes Gaussian 
models can be handled flexibly and conveniently. 

Nakayama et al. ll35l also represented an image with 
a global Gaussian for scene categorization. However, 
they matched two Gaussian models by using the Kullback- 
Leibler (KL) divergence, and hence kernel-based classifiers 
have to be used. This method is not scalable and has high 
computational cost. In contrast to [35], our metric is decou¬ 
pled which allows a linear classifier to be combined, which 
makes our method more efficient and scalable than the KL 
kernel based one in [35l . Moreover, compared with the ad- 
hoc linear kernel (Euclidean baseline) in |[35l . our method 
takes advantage of the geometry structure of Gaussian mod¬ 
els and brings large performance improvement. 

There is another line of research on codebookless model 
methods. Grauman et al. lf20l proposed a pyramid match 
kernel to map feature sets to multi-resolution histograms, 
and employed histogram intersection kernel for classifica¬ 
tion. Bo et al. 0 presented efficient match kernels to map 
local features into a low dimensional space, and adopted a 
linear classifier. Boiman et al. 0 developed an image-to- 
class distance between the sets of local features, and em¬ 
ployed a nearest neighbor classifier. Yao et al. [501 pro¬ 
posed a codebook-free approach by using a large number of 
randomly generated image templates for image representa¬ 
tion, and developed a bagging-based classifier. 


where n = ± £^1 x* and £ = (x* - /^(x* - 

Ii) T are mean vector and covariance matrix, and det(-) de¬ 
notes matrix determinant. Compared with histogram and 
covariance, Gaussian model is more informative. Mean¬ 
while, unlike matching of signatures l38l or GMMs 0, 
matching of Gaussian models does not bring high computa¬ 
tional cost. 


3.2. Two-step metric between Gaussian models 

To match Gaussian models, we exploit a two-step metric 
which has been proposed to compute the ground distance 
between Gaussian components of GMMs l32l . The first 
step is to embed Gaussian manifold into the space of SPD 
matrices [33), and then map the Lie group of SPD matrices 
into its corresponding Lie algebra, a linear space, by using 
the Log-Euclidean metric 0 . 

The space of ^-dimensional Gaussian models is a Rie- 
mannian manifold. Let X) be a Gaussian model with 
mean vector /i and covariance matrix J2. Through a contin¬ 
uous function i r, A/*(//, £) is mapped to an affine matrix, an 
element in the affine group = {(//, P)|m G M fexl , P G 
M /cx/c , det(P) > 0}; that is, 


7r i A/”(/Lt, 5 j) i —y A — 


P 

0 T 


l 1 

1 5 


( 1 ) 


where Y) = PP T is the Cholesky factorization of 5L Fur¬ 
ther, through the function 7 : A S = AA T , A is 
mapped to an SPD matrix S. So far, by the successive 
functions 7 r and 7 , A/*(/i, £) is uniquely designated as an 
(k + 1) x {k + 1) SPD matrix 


A'(alS) 


Y) + /x/i T 11 

m t 1 


( 2 ) 


3. Proposed method 

We first introduce the image representation by a single 
Gaussian model. Then, we employ an effective and efficient 
two-step metric for matching Gaussian models, and propose 
two well-motivated parameters to improve the used distance 
metric. Finally, we present a joint learning method of low- 
rank transformation and SVM on Gaussian manifold. 


3.1. Gaussian model for image representation 


Given an input image, we extract a set of N local fea¬ 
tures {x^ G M fexl , i = 1,..., N} at a dense grid. By the 
maximum likelihood method, the image can be represented 
by the following Gaussian model: 


V(xj|/x,£) 


exp ( - ^(xj — /l 7) t £ 1 (x i -/x)) 
^/(27r)*det(E) 


Please refer to lf33l for details on the embedding process. 

The space of (k+ 1) x (k+ 1) SPD matrices S£ +1 * s a Lie 
group that forms a Riemannian manifold. Two operations, 
namely the logarithmic multiplication and the scalar log¬ 
arithmic multiplication, are defined in the Log-Euclidean 
metric 0 , which equip with structures of not only 
the Lie group but also vector space. Through the matrix 
logarithm, 5 A } +1 is mapped into its Lie algebra Sk+i, the 
vector space of (fc + 1) x (fc + 1) symmetric matrices. The 
matrix logarithm is a deffemorphism and an isomorphism 
so that operations over SPD matrices can be replaced by 
the Euclidean operations of their counterparts in the vector 
space. So, through the matrix logarithm, an SPD matrix S 
is one-to-one mapped to a symmetric matrices G which lies 
in a linear space, and the geodesic distance between SPD 
matrices S i and S j is defined by dists^Sj = ||G* — Gj ||i?, 
where F is the Frobenius norm. 
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3.3. Two well-motivated parameters 


In practice, we found that it is important to balance mean 
vector and covariance matrix in the embedding matrix ©, 
because their dimensions and order of magnitude of each 
dimension may vary considerably. Meanwhile, the effect of 
mean vector and covariance matrix may vary for different 
tasks. With these considerations, we introduce a parameter 
P > 0 in the function tt dTJ: 


7 r(/3) : A/■(//, X) ^ A 


P ppL 

0 T 1 


(3) 


Let Mi = M(fii , 5]^) and Mj = M(Hj, Xj) be two Gaus¬ 
sian models and their corresponding symmetric matrices are 
Gi(/?, p) and Gj(P,p). The distance between two Gaus¬ 
sian models is 


dist MiMj = ||G iGM - Gj(/3,p)|| F . (8) 

It is easy to know that distance (5]) is decoupled so that 
Gi(P,p) and G j(P,p) can be computed separately and 
adopted in a linear classifier. For notational simplicity, we 
omit the parameters p and p in the distance measure ([8]). 


Accordingly, the embedding matrix has the following form: 


Mfa'E) ~ S(/3) 


X + p 2 /jifj J T fill 

/3pi T 1 


(4) 


The embedding matrix © reduces to the covariance matrix 
when P — 0, and is equal to the original one when P = 1. 
Hence, the role of mean vector and covariance matrix can 
be adjusted by p. 

The maximum likelihood estimator of the empirical co- 
variance matrix is susceptible to interference of noise, es¬ 
pecially for high dimension space lfl5l . Based on observa¬ 
tion that the maximum likelihood estimator of covariance 
ought to be improvable by eigenvalue shrinkage (421, we 
exploit power normalization on the eigenvalues of covari¬ 
ance matrix (EPN). Let M(/jl, XI) be a Gaussian model es¬ 
timated from a set of descriptors extracted from some im¬ 
age. The covariance matrix XI has eigenvalue decomposi¬ 
tion X = Udiag(Ai)U T , where U is an orthornormal ma¬ 
trix whose i th column is the eigenvector of X and A^ > 0 is 
the corresponding eigenvalue, and diag(-) denotes diagonal 
matrix. Then by introducing a parameter p, our normaliza¬ 
tion is defined as 


X p = Udiag(Af )U T , with 0 < p < 1. (5) 


3.4. Joint low-rank learning and SVM classifier 

Our CLM usually is of high dimension (> 10 4 ). In or¬ 
der to suppress redundant and noisy information while re¬ 
ducing computational and storage cost, we propose a low- 
rank learning method to compact our CLM. The matrix G 
in geodesic distance ([8j is a (fc-|-1) x (fc +1) symmetric ma¬ 
trix which lies in the Euclidean space. Due to its symmetry, 
we can unfold the upper triangular part of G to a vector of 
size d— (fc + 1) x (k + 2)/2. We can modify geodesic dis¬ 
tance ® by introducing a low-rank transformation matrix 
L G M dxr , r <C d: 

distx^ = ||L T (f i ~fj)|| 2 , (9) 

where and f ) are the unfolding vectors of two Gaussian 

models Mi and Mj , respectively. 

Recent researches (26l |49) have shown that joint op¬ 
timization of dimensionality reduction with classifier per¬ 
forms better than separate optimization of the two mod¬ 
ules. Thus, given N training samples {f n , n G [1,7V]}, we 
optimize the low-rank learning jointly with a linear SVM 
(LRSVM): 


With EPN, our final embedding matrix is: 


V(m,s)~sgm 


s ' 5 + 

f3» T 


f3n 

1 


( 6 ) 


It is easy to prove that the embedding matrix © is still pos¬ 
itive definite as being an SPD matrix. The eigenval- 
ues power normalization has been proposed to measure dis¬ 
tances between covariance matrices mm or tensor (29), 
namely, Power-Euclidean metric. Different from previous 
work, we use eigenvalues power normalization for robust 
estimation of covariance matrices in Gaussian setting for 
the case of high dimensional features, and compare Gaus- 
sians by using Gaussian embedding and the Log-Euclidean 
metric. 

According to the Log-Euclidean framework, the matrix 
S (P,p) can be further embedded into a linear space by ma¬ 
trix logarithm: 


G(/?,p) = log(S(/3,p)). (7) 


min 

L,w,£ 

S.t. 


Jiiwii 2 +cy> (io) 

n=1 

Vn( w T L T f ra +b)> 1 - £„,V£n > 0 ,ne [1,IV], 


L t L = I, 


where w, £, b are parameters of SVM, and y n is the label 
of f n . The dimensionality reduction for SPD matrices lf23l 
has been studied with dimensionality reduction and clas¬ 
sification separately performed, while our method is quite 
different in that we focus on Gaussian models and perform 
joint learning of low-rank transformation and SVM. 

In practice, we extend the objective function (ITOl) to 
multi-class problem under the spatial pyramid matching 
(SPM) framework (30). Given an image J n , we can obtain 
its SPM representation F n = [(f^) T ,..., (f^f ) T ] T , where 
B is the number of blocks in SPM, which is fed to a one vs. 
all SVM for solving the M classes problem. As suggested 
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in (26), we optimize the dual problem of the objective func¬ 
tion (fTOl) under the SPM framework: 

M N i 

min max ( Va"- -(a^Y m FHF T Y m a m )) 

£ m= i n= i ^ 

N 

s-t. 'Y^VmKi = 0-0 < Oi m < cym (11) 

n=1 

L t L = I, L t = Diag(Lf,..., L T B ), H = LL T , 

where F = [Fi,..., F/v] T indicates all training features, 
and Y m is the diagonal label matrix of the rath class with 
diagonal element Y m (n, n) = y 7 ^. 

The problem ([TIT) is non-convex and can be optimized by 
a two-step alternating method: Step One , fixing L, we can 
optimize the Lagrange parameters a m with off-the-shelf 
SVM; Step Two , for fixed a m , we solve the following trace 
maximization problem: 

M x 

L t F t £) (Y mQmQ K)Fl) (12) 

m=l ' 

s.t. L t L = I, L T = Diag(Lf,..., l£). 

We optimize the problem (fl2l) by independently solving 
each Lj , i = 1,..., B with a close-form solution (26) . 
Due to the problem (fill) being non-convex, initialization 
is nontrivial to reach a good local optimal solution and for 
fast convergence. In this paper, we use the basis of princi¬ 
pal component analysis (PCA) as initialization, and we find 
that it can always achieve good performance and fast con¬ 
vergence. 

4. Partial background removal (PER) 

We then present a simple yet effective method for an¬ 
alyzing and handling the side effect of background clut¬ 
ter based on unsupervised, bottom-to-up saliency detection. 
Our purpose here is to remove the interference of back¬ 
ground, which is distinguished from the purpose of pre¬ 
cise foreground localization in saliency detection commu¬ 
nity. Our method consists of two steps: coarse foreground 
detection and partial background removal. In the first step 
we localize in image the foreground based on saliency de¬ 
tection method (27) and then determine the bounding-box 
surrounding the foreground. Next, we adaptively expand 
bounding-box to accommodate some background regions 
based on size and intensity variance of the area inside the 
bounding-box. Then, the area outside bounding-box is re¬ 
moved for recognition. Our method is based on the consid¬ 
erations that accurate foreground detection is currently very 
difficult and neighboring regions of object can serve as the 
context and may be helpful for recognition. In our experi¬ 
ments, we adopt PBR to the two datasets with heavy back¬ 
ground clutter: CUB200-2011 and VOC2007. Since PBR is 


designed for foreground objects with separable background 
clutter, we do not perform PBR on images with less back¬ 
ground clutter and scene images where both foreground and 
background are valuable for scene understanding. 

5. Implementation details 

We extract multi-scale SIFT descriptors E) (standard 
pipeline in the BoF model) with cell size 2\ i = 1,2,..., 
and single scale pixel-wise covariance descriptor (44) via 
the dense sampling strategy with step-length 2. The dense 
covariance descriptors are computed with 17 dimensional 
raw features including intensity and four kinds of first-order 
and second-order gradients from (37) . We perform matrix 
logarithm on the covariance descriptors (LogCov), which 
are then vectorized. The SIFT features are calculated via 
the VLFeat library (46) . Moreover, following (9l flQl . we 
also extract additional image cues, including color, loca¬ 
tion, scale, gradient and entropy to concatenate SIFT and 
LogCov. In order to ensure that there is sufficient data to 
estimate Gaussian models and covariance matrices are pos¬ 
itive definite, we limit the minimum size of width or height 
of images to be larger than 64, and add 10 _3 to the diagonal 
entries of covariance matrices, respectively. We employ the 
spatial pyramid strategy (30) which divides an image into 
some regular regions (e.g., 1 x 1, 2 x 2, 1 x 3, 4 x 4). For 
each region we compute a Gaussian model, and then con¬ 
catenate them to represent the whole image. Each Gaussian 
is weighted by r , where L and Ni are the number 

2^ i=i l/A'Z 

of pyramid levels and regions in the / th layer, respectively. 
We implement a one-vs-all SVM with LibSVM im and set 
parameter C to 0.01 on VOC2007 and 10 on all the other 
databases. All algorithms are written in Matlab, and run on 
a PC equipped with i7-4770k CPU and 32G RAM. 

6. Experimental evaluation 

In this section, we evaluate the classification perfor¬ 
mance of our CLM on eight benchmark databases. First 
of all, we make an analysis of local features, the parameters 
of our method, the proposed low-rank learning method and 
the partial background removal method on the challenging 
CUB200-2011 (47) . Then, we compare with state-of-the- 
art methods on CaltechlOl 01), Caltech256 ED, KTH- 
TIPS2b 0, Flickr Material Database (FMD) (22) . Pascal 
VOC2007 02 ), Scene 15 (30) and Sports 8 E). Finally, we 
analyze the computational complexity of our CLM. 

6.1. Parameters analysis 

Local descriptors Four kinds of local descriptors, SIFT 
(ST) and its enrichment (eST), and LogCov (LC) and its en¬ 
richment (eLC), are evaluated in this section. The results of 
our CLM with various local descriptors on CUB200-2011 
are shown in Table Q] We can see that the Gaussian model 
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Table 1. Classification results (in %) of our CLM vs. various com¬ 
binations of descriptors, parameters and background removal on 
CUB200-2011. 


used in our method outperforms covariance matrix by 1.5% 
or higher with either SIFT or eSIFT, which, we believe, 
can indicate that the first-order (mean) information is non¬ 
trivial. We use eST to evaluate other parameters as follows. 
Two well-motivated parameters The proposed EPN (0) is 
a generic method for robust estimation of covariance in high 
dimension space. We set parameter p in EPN (0) as 0.5 in 
all databases. From Table [TJ we can see that EPN can bring 
1.2% performance gain over the relevant method without 
EPN. The embedding parameter /3 © balances the effect of 
mean vector and covariance matrix. To test its effect, we 
determine the optimal value of (3 via cross validation. The 
performances of our CLM with various (3 are illustrated in 
Fig. [3] (left). Compared to [3 = 0 (covariance matrix only 
10 El) and f3 = l (the embedding in 031 ). appropriate 
balancing at (3 = 0.4 achieves 2.4% and 0.9% gains, re¬ 
spectively. 

LRSVM To evaluate the proposed LRSVM method, we 
compare LRSVM with unsupervised principal component 
analysis (PCA) and supervised partial least square (PLS) 
m under different compression ratios. The LRSVM is ini¬ 
tialized by PCA, and the results on CUB200-2011 are illus¬ 
trated in Fig. [3] (right). From it we can see that LRSVM 
always performs better than PLS, and is superior to PCA by 
a large margin. Different from PLS which exploits the least 
squares loss, LRSVM uses the hinge loss. We argue that the 
improvement owes to the joint learning of dimensionality 
reduction and classifier. Note that, with larger compression 
ratio, LRSVM achieves larger improvement over PCA and 
PLS. Meanwhile, the proposed LRSVM has insignificant 
performance loss (less than 1.5%) with large compression 
ratio (> 100). We also can see that LRSVM can slightly 
improve the performance of our CLM when compression 
rations are smaller (< 80), which we owe to that LRSVM 
can suppress some noisy information. In general, we set 
compression ratio as 80 ^ 100 to balance the efficiency 
and effectiveness. 

Impact of PBR We apply PBR to CUB200-2011 and the 



Figure 3. Effect of balance parameter (3 in Eq. © (left) and com¬ 
parison of PCA, PLS and our LRSVM with various compression 
ratios on CUB200-2011 (right). 


results are presented in Table 0 We can see that the method 
using PBR achieves great gains (more than 7.5%) over the 
one without PBR. Note that we achieve about 1% gain in 
VOC2007 by using PBR. It shows that our PBR is a general 
method to handle background for CLM. The gains achieved 
by using ground truth (GT) bounding box indicate more ad¬ 
vanced background removal methods have further ability to 
improve the recognition performance of our CLM. Com¬ 
pared with the improvement in CUB200-2011, the gains in 
VOC2007 are relative small. The reasons are mainly that 
the saliency-based methods fail to locate precisely the fore¬ 
grounds in the challenging databases, and CUB200-2011 
only contains one object per image while one image may 
contain multiple objects in VOC2007. PBR can not segment 
image into multiple objects so that multi-object images will 
heavily influence the performance of CLM. 

6.2. Comparison with state-of-the-art methods 

We compare our CLM with more than ten state-of-the- 
art methods on eight widely used benchmarks. The descrip¬ 
tions and experimental setup on these benchmarks are listed 
in Table [2] We report the results in Table 0 and discuss the 
experimental results as follows. 

Comparison of various local descriptors We combine 
our CLM with four kinds of local descriptors, and assess 
them on all databases. From Table 0 we can see that 
SIFT and LogCov achieve comparable results. For object 
recognition, LogCov is superior to SIFT on CUB200-2011 
and VOC2007 while SIFT outperforms LogCov on Cal- 
techlOl and Caltech256. On scene categorization, SIFT 
and LogCov obtain similar performances on both Sports8 
and Sencel5. For texture and material classification, SIFT 
achieves gains over LogCov on KTH-TIPS2b while Log¬ 
Cov is superior to SIFT by a large margin on FMD. The 
eSIFT and eLogCov perform with the similar rule as SIFT 
and LogCov, respectively. The enrichment on SIFT and 
LogCov can considerably boost the performance of our 
CLM, which encourages us to utilize more informative de¬ 
scriptors for further improvement. 

Comparison with counterparts Here, we compare our 
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Database 

Classes 

Images in total 

Training/Test 

Measurement 

Scale 

View 

Illumination 

Pose 

Bg Clutter 

Occlusion 

CUB200-2011 |47l 

200 

11,788 

Split in j47j 

Ace. of split 

S 






CaltechlOl tl8l 

102 

9,144 

30/remaining per class 

Ace. of 5 runs 

S 



S 



Caltech256 12 1 1 

256 

30,607 

30/remaining per class 

Ace. of 5 runs 

S 



S 


S 

Sports8 |3JJ 

8 

1,792 

70/60 per class 

Ace. of 5 runs 


S 

S 

S 

S 


KTH-TIPS2b (8] 

11 

4,752 

"HSJ 

Ace. of splits 

S 

S 

S 




FMD l22l 

10 

1,000 

50/50 per class 

Ace. of 5 runs 

S 

S 

S 




VQC2007 Il7l 

20 

9,963 

Split in tl7l 

mAP of split 

s 

S 

S 

S 

S 

S 

Scene 15 |30] 

15 

4,485 

100/remaining per class 

Ace. of 5 runs 


S 

S 


S 



Table 2. Descriptions and experimental setup on eight widely used benchmarks. 

(a) CUB200-2011 (b) CaltechlOl (c) Caltech256 (d) Sports8 


Methods 

Ace. 

Methods 

Ace. (Tr. = 30) 

Methods 

Ace. (Tr. = 30) 

Methods 

Ace. 

BoF-hard [30| 


18.6 

FV+SIFT 1391 

80.8 ±0.3 

BoF-LLC [48] 


41.2 

FV+SIFT 139! 

91.3 ± 1.3 

FV [39] 


25.8 

FV+eSIFT 

83.7 ±0.3 

FV+SIFT 139] 

47.4 ±0.1 

FV+eSIFT 

90.4 ± 1.2 

FV + eSIFT 


27.3 

DeCAF [14] 

86.9 ± 0.7 

FV+eSIFT 

50.1 ± 0.3 

Kobayashi2014 [28] 

92.6 ± 0.7 

Kobayashi2014 (28] 

27.3 

02P+eSIFT fT0l 


80.8 

Kobayashi2014 [28]] 

49.8 ± 0.1 

GG (ad-linear) [35] 

80.2 

PPKI5T1 


28.2 

SQ-02P+SIFT (7] 


79.5 

NBNN [6] 


43 

GG (ct-linear) [35] 

82.9 ± 1.0 

CLM (SIFT) 


18.6 

NBNN (3 

77.8 ± 0.3 

M-HMP [6] 


50.7 

GG + KL Div. (35] 

84.4 ± 1.4 

CLM (eSIFT) 


28.1 

CLM (SIFT) 

84.9 ±0.1 

CLM (SIFT) 

48.9 ± 0.2 

CLM (SIFT) 

88.8 ± 1.0 

CLM (LogCov) 


19.1 

CLM (eSIFT) 

86.3 ±0.3 

CLM (eSIFT) 

53.6 ± 0.2 

CLM (eSIFT) 

91.5 ± 1.2 

CLM (eLogCov) 


28.6 

CLM (LogCov) 

82.5 ± 0.3 

CLM (LogCov) 

48.6 ± 0.3 

CLM (LogCov) 

88.3 ± 1.3 

CLM (eSIFT) + PBR 

36.0 

CLM (eLogCov) 

84.7 ±0.2 

CLM (eLogCov) 

53.2 ± 0.1 

CLM (eLogCov) 

90.7 ± 0.7 

(e) KTH-TIPS2b 

(f) FMD 

(g) VOC2007 

(h) Scene 15 

Methods 


Ace. 

Methods 


Ace. 

Methods 


mAP. 

Methods 

Ace. 

BoF-LLC |48l 

57.6 ± 2.3 

VLAD 1251 


52.6 ± 1.5 

BoF-LLC 1-481 


57.4 

SV [53] 

85.0 

VLAD (25] 

63.1 ± 1.0 

FV+SIFT [39] 


58.3 ± 1.0 

SV [53] 


58.2 

FV+SIFT [39] 

88.1 ± 0.2 

FV+SIFT [39] 

69.3 ± 1.0 

FV+eSIFT 


58.9 ± 1.7 

SQ-02P+SIFT [7] 

51.0 

FV+eSIFT 

89.4 ± 0.2 

FV+eSIFT 

71.3 ± 3.1 

Kobayashi2014 1 28 1 

57.3 ± 0.9 

FV+SIFT [39] 


61.8 

GG (ad-linear) [35] 

79.8 

DeCAF [14] 

70.7 ± 1.7 

DeCAF [14] 


60.7 ± 2.1 

FV+eSIFT 


60.8 

GG (ct-linear) (351 

82.3 ± 0.4 

Attributes [73] 

73.8 ± 1.3 

Attributes fi3] 


61.1 ± 1.4 

Kobayashi2014 (28] 

63.8 

GG + KL Div. (35] 

86.1 ± 0.5 

CLM (SIFT) 

71.8 ±3.1 

CLM (SIFT) 


51.6 ± 1.2 

CLM (SIFT) 


55.8 

CLM (SIFT) 

88.1 ± 0.4 

CLM (eSIFT) 

75.2 ± 2.6 

CLM (eSIFT) 


57.7 ± 1.6 

CLM (eSIFT) 


60.4 

CLM (eSIFT) 

89.4 ± 0.4 

CLM (LogCov) 

72.2 ± 3.3 

CLM (LogCov) 


62.4 ± 1.5 

CLM (LogCov) 


56.6 

CLM (LogCov) 

88.3 ± 0.6 

CLM (eLogCov) 

73.6 ± 2.6 

CLM (eLogCov) 

64.2 ± 1.0 

CLM (eLogCov) 


61.7 

CLM (eLogCov) 

89.2 ± 0.5 


Table 3. Comparison (in %) with state-of-the-art methods on eight widely used benchmark datasets 


CLM with its counterparts, 02P (TO), Global Gaussian 
(GG) J35) and NBNN 0. As shown in Tables □ & H 
our CLM significantly outperforms 02P lITOl on CUB200- 
2011 and CaltechlOl, and is also superior to its variant 
with sparse quantization (SQ-02P) (7) on CaltechlOl and 
VOC2007 by a large margin, which are mainly due to the 
appropriate use of mean information and EPN. Moreover, 
our CLM performs much better than GG methods 031 with 
ad-hoc linear kernel (ad-linear), center tangent linear kernel 
(ct-linear) and KL divergence on Sports8 and Sencel5. The 
ad-linear can be seen as a baseline in Euclidean space. It 
is mentionable that the methods in (35\ exploit probabilis¬ 
tic discriminant analysis (PDA) as a classifier. If SVM is 
used, their results will drop to 71.7%, 78.8% and 81.4% on 
Sports8, and 74.3%, 80.7% and 83.1% on Scenel5, respec¬ 
tively. We attribute the gains of our CLM over |35 1 to the 
use of two-step metric with the proposed well-motivated pa¬ 
rameters. We also compare our CLM with NBNN 0 . n is 
easy to see that our CLM performs much better than NBNN 
on CaltechlOl and Caltech256. The main differences be¬ 
tween our CLM and NBNN are that our CLM employs an 
effective model-to-model distance and SVM classifier. 
Comparison with FV We make a comprehensive com¬ 


parison with one state-of-the-art BoL method, LV (39), 
throughout all databases, and also adopt enrichment SILT 
(eSILT) to LV. On all databases except for LMD, our CLM 
achieves better than or comparable performances with LV 
when SILT or eSILT is used. On LMD, with SILT or eSILT, 
our CLM is inferior to LV, but with LogCov or eLogCov, 
our CLM is much better than LV. In our experiments, we 
find that LogCov or eLogCov is not very suitable for LV, 
so the relevant results are not reported. It is found that 
our CLM is more sensitive to local descriptors than LV, as 
eSILT brings less or no gains on LV while our CLM greatly 
benefits from the enrichment on SILT or LogCov. 
Comparison with other state-of-the-art methods Some 
recent results are also presented for comparison. On Cal- 
tech 101, DeCAF O with 6 layers CNN and dropout strat¬ 
egy [411 slightly outperforms our CLM. Without dropout, 
the result of DeCAL drops to 84.8%. On Caltech256, our 
CLM outperforms the deep architecture Multipath Hierar¬ 
chical Matching Pursuit (M-HMP) (4) by 2.9%. Cimpoi 
et al. [13] achieved state-of-the-art results on KTH-TIPS2b 
and LMD with semantic attributes which are trained on the 
additional database by combining LV 1 39] and DeCAL [14] . 
Our CLM is superior to the method with attributes, LV and 
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DeCAF. By combining attribute features, FV and DeCAF, 
Cimpoi et al. lH3l obtained 77.3% and 67.1% accuracy on 
KTH-TIPS2b and FMD. Kobayashi l28l proposed a his¬ 
togram transformation method, and it achieves state-of-the- 
art results on Sports8 and VOC2007. 

Summary In this paper, we assess our CLM on eight image 
benchmarks, as shown in Table 0 which contains various 
transformations or noisy factors. We claim that (1) the re¬ 
sults on CaltechlOl and Caltech256 show that our CLM can 
well deal with location and pose variations of objects; (2) 
the results on FMD and KTH-TIPS2b show that our CLM 
is robust to scale, viewpoint, illumination and appearance 
variation; (3) the results on Sports8 and Sencel5 indicate 
our CLM can well classify scene images with certain back¬ 
ground clutters; and (4) the results on CUB200-2011 and 
VOC2007 demonstrate our CLM also can handle images 
with complex surroundings, such as heavy background clut¬ 
ters and occlusion. 

6.3. Computational complexity analysis 

Our CLM for classification mainly consists of three com¬ 
ponents: extracting local descriptors, computing Gaussian 
models using Eq.® followed by EPN ® and matrix log¬ 
arithm in Eq.®, and learning LRSVM for classification. 
Most of the computational costs of CLM lie in the eigen¬ 
value decomposition produced by EPN and matrix log¬ 
arithm. Their computational complexity are 0(k 3 ) and 
0((k + l) 3 ), respectively, where k is the dimension of 
local descriptors. During joint training of low-rank ma¬ 
trix and SVM classifier, optimizing the objective func¬ 
tion <HD consists of alternating SVM minimization prob¬ 
lem and trace minimization problem, whose complexity is 
0(J(N 2 D + D 3 + Bd 3 )), where N is the number of train¬ 
ing samples of dimension D = Bd, and J is the number of 
iterations which is less than 3 in our experiments. 

Here, we give empirical running time by taking KTH- 
TIPS2b and CaltechlOl as examples. The time of comput¬ 
ing image representation, which includes extraction of SIFT 
at multiple scales, and the time of computation of Gaussian 
models and embedding matrices, are 30 minutes on KTH- 
TIPS2b and 1.5 hours on CaltechlOl. The average time of 
modeling one image takes about 0.4 second and 0.6 second 
on relevant databases. For each trial, training (resp. test) 
of LRSVM takes 20s (resp. 2s) and 7min (resp. 40s) on 
KTH-TIPS2b and CaltechlOl, respectively. 

7. Discussion and conclusion 

The bag-of-features (BoF) is a popular method in clas¬ 
sification and recognition fields, demonstrating convincing 
performance in many computer vision tasks in the past 
years. It might seem that training codebook & descriptor 
coding are indispensable ingredients. However, the code¬ 
bookless model (CLM) proposed in this work has proven 


to be an effective alternative method to the BoF methods 
for image classification. Below we give some discussions 
about why CLM shows such competitive performance. 

Different from the BoF methods, our CLM leverages 
continuous functions for statistical modeling of local de¬ 
scriptors, which does not need codebook and thus has no 
quantization brought in. Recent research H2l showed that 
high dimensionality can bring impressive performance. The 
state-of-the-art BoF methods such as SV/VLAD or FV have 
inherently high dimensionality, which, in our opinion, is the 
key for characterizing distinctness and discriminativess of 
individual images as well as image categories. Our CLM 
directly employs the first- and second-order statistics of 
high dimensional local descriptors, giving rise to informa¬ 
tive image-level models of high dimensionality as well. In 
this respect, it is worthwhile to study more informative or 
high dimensional CLM. Moreover, as shown in EKED, the 
CLM is more efficient than the BoF methods for modeling 
images because learning codebook & coding are not neces¬ 
sary. In addition, the CLM may be more suitable for the 
tasks where the datasets will be regularly updated or in¬ 
creased, and thus the codebook in the BoF model has to 
be regularly adjusted to fit the changing datasets. 

The contributions of this paper are concluded as follows. 

(1) Our work has clearly shown that the CLM is a very com¬ 
petitive alternative to the mainstream BoF model. We hope 
our work can raise potential interests in the classification 
(or retrieval) community and pave a way to future research. 

(2) Our method enables Gaussian models to be success¬ 
fully combined with linear SVM classifier, which makes 
our method scalable and efficient. The key is that we embed 
Gaussian models into a vector space which also allows us 
to perform joint low-rank learning and SVM on Gaussian 
manifold. Meanwhile, the proposed two well-motivated pa¬ 
rameters further improve our CLM. (3) We performed ex¬ 
tensive experiments, evaluating various aspects of our CLM 
and comparing with its counterparts as well as state-of-the- 
art methods. The comprehensive experiments demonstrated 
the promising performance of our CLM. 

References 

[1] J. Arenas-Garcla, K. B. Petersen, and L. K. Hansen. Sparse 
kernel orthonormalized PLS for feature extraction in large 
data sets. In NIPS, 2006. 

[2] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Fast and 
simple calculus on tensors in the Log-Euclidean framework. 
In M/CCA/, 2005. 

[3] C. Beecks, A. M. Zimmer, S. Kirchhoff, and T. Seidl. Mod¬ 
eling image similarity by gaussian mixture models and the 
signature quadratic form distance. In ICCV , 2011. 

[4] L. Bo, X. Ren, and D. Fox. Multipath sparse coding using 
hierarchical matching pursuit. In CVPR, 2013. 

[5] L. Bo and C. Sminchisescu. Efficient match kernel between 
sets of features for visual recognition. In NIPS, 2009. 


[6] O. Boiman, E. Shechtman, and M. Irani. In defense of 
nearest-neighbor based image classification. In CVPR, 2008. 

[7] X. Boix, G. Roig, S. Diether, and L. V. Gool. Self-adaptable 
templates for feature coding. In NIPS, 2014. 

[8] B. Caputo, E. Hayman, and P. Mallikarjuna. Class-specific 
material categorisation. In ICCV, 2005. 

[9] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Se¬ 
mantic Segmentation with Second-Order Pooling. In ECCV, 
2012 . 

[10] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. 
Free-Form Region Description with Second-Order Pooling. 
TPAMI, PP: 1,2014. 

[11] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support 
vector machines. ACM TIST, 2(3):27, 2011. 

[12] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimension¬ 
ality: High-dimensional feature and its efficient compression 
for face verification. In CVPR , 2013. 

[13] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and 
A. Vedaldi. Describing textures in the wild. In CVPR , 2014. 

[14] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, 
E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti¬ 
vation feature for generic visual recognition. In ICML , 2014. 

[15] D. L. Donoho, M. Gavish, and I. M. Johnstone. Optimal 
shrinkage of eigenvalues in the spiked covariance model. 
arXiv, 1311.0851, 2014. 

[16] L. Dry den, A. Koloydenko, and D. Zhou. Non-euclidean 
statistics for covariance matrices, with applications to diffu¬ 
sion tensor imaging. Annals of Applied Statistics, 2009. 

[17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, 
and A. Zisserman. The Pascal Visual Object Classes (VOC) 
Challenge. IJCV, 88(2):303-338, 2010. 

[18] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of 
object categories. TPAMI, 28(4):594—611, 2006. 

[19] L. Gong, T. Wang, and F. Liu. Shape of gaussians as feature 
descriptors. In CVPR, 2009. 

[20] K. Grauman and T. Darrell. The pyramid match kernel: 
Discriminative classification with sets of image features. In 
ICCV, 2005. 

[21] G. Griffin, A. Holub, and P. Perona. The Caltech-256. Tech¬ 
nical report, California Institute of Technology, 2007. 

[22] L. haran, R. Rosenholtz, and E. H. Adelson. Material per¬ 
ception: What can you see in a brief glance? Jour, of Vis., 
9(8):784, 2009. 

[23] M. T. Harandi, M. Salzmann, and R. Hartley. From manifold 
to manifold: Geometry-aware dimensionality reduction for 
spd matrices. In ECCV, 2014. 

[24] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Ha¬ 
randi. Kernel methods on the riemannian manifold of sym¬ 
metric positive definite matrices. In CVPR, 2013. 

[25] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating 
local descriptors into a compact image representation. In 
CVPR, 2010. 

[26] S. Ji and J. Ye. Linear dimensionality reduction for multi¬ 
label classification. In IJCAI, 2009. 

[27] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang. 
Saliency detection via absorbing markov chain. In ICCV, 
2013. 


[28] T. Kobayashi. Dirichlet-based histogram feature transform 
for image classification. In CVPR, 2014. 

[29] P. Koniusz, F. Yan, P.-H. Gosselin, and K. Mikolajczyk. 
Higher-order Occurrence Pooling on Mid- and Low-level 
Features: Visual Concept Detection. Technical report, 2013. 

[30] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of 
features: Spatial pyramid matching for recognizing natural 
scene categories. In CVPR, 2006. 

[31] L.-J. Li and F.-F. Li. What, where and who? classifying 
events by scene and object recognition. In ICCV, 2007. 

[32] P. Li, Q. Wang, and L. Zhang. A novel earth mover’s dis¬ 
tance methodology for image matching with gaussian mix¬ 
ture models. In ICCV, 2013. 

[33] M. Lovric, M. Min-Oo, and E. A. Ruh. Multivariate normal 
distributions parametrized as a riemannian symmetric space. 
JMVA, 74(1):36—48, 2000. 

[34] D. G. Lowe. Distinctive image features from scale-invariant 
keypoints. IJCV, 60(2):91-110, 2004. 

[35] H. Nakayama, T. Harada, and Y. Kuniyoshi. Global gaussian 
approach for scene categorization using information geome¬ 
try. In CVPR, 2010. 

[36] X. Pennec, P. Fillard, and N. Ayache. A riemannian frame¬ 
work for tensor computing. IJCV, pages 41-66, 2006. 

[37] W. K. Pratt. Digital Image Processing, 4th Edition. John 
Wiley & Sons, Inc., New York, NY, USA, 2007. 

[38] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover’s 
Distance as a metric for image retrieval. IJCV, 40(2):99-121, 
2000. 

[39] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image 
classification with the Fisher vector: Theory and practice. 
IJCV, 105(3):222-245, 2013. 

[40] J. Sivic and A. Zisserman. Video Google: A text retrieval 
approach to object matching in videos. In ICCV, 2003. 

[41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and 
R. Salakhutdinov. Dropout: A simple way to prevent neural 
networks from over fitting. JMLR, 15:1929-1958, 2014. 

[42] C. Stein. Lectures on the theory of estimation of many pa¬ 
rameters. Jour, of Math. Sci., 34(1): 1373-1403, 1986. 

[43] V. Sydorov, M. Sakurada, and C. H. Lampert. Deep fisher 
kernels - end to end learning of the Fisher kernel GMM pa¬ 
rameters. In CVPR, 2014. 

[44] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast 
descriptor for detection and classification. In ECCV, 2006. 

[45] J. van Gemert, C. J. Veenman, A. W. M. Smeulders, 
and J.-M. Geusebroek. Visual word ambiguity. TPAMI, 
32(7): 1271-1283, 2010. 

[46] A. Vedaldi and B. Fulkerson. VLFeat: An open 
and portable library of computer vision algorithms, 
http : / /www .vlfeat.org/, 2008 . 

[47] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. 
The Caltech-UCSD Birds-200-2011 Dataset. Technical re¬ 
port, 2011. 

[48] J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong. 
Locality-constrained linear coding for image classification. 
In CVPR, 2010. 

[49] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to 
large vocabulary image annotation. In IJCAI, 2011. 


9 


[50] B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free and 
annotation-free approach for fine-grained image categoriza¬ 
tion. In CVPR, 2012. 

[51] N. Zhang, R. Farrell, and T. Darrell. Pose pooling kernels for 
sub-category recognition. In CVPR , 2012. 

[52] W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian. To¬ 
wards codebook-free: Scalable cascaded hashing for mobile 
image search. TMM, 16(3):601-611, 2014. 

[53] X. Zhou, K. Yu, T. Zhang, and T. S. Huang. Image classifi¬ 
cation using super-vector coding of local image descriptors. 
In ECCV ., 2010. 


10 


